One morning, you a data analyst get new dataset, and at afternoon you need present some insight from the data to your supervisor, unfortunately sometimes this happens. The best way to present some insight of course with visualisation, gladly there is this visualisasion library called plotly-express that can help you with that. Plotly Express is a terse, consistent, high-level API for rapid data exploration and figure generation. It's suppose help you to visualize your data quick and easy. So I creating this course book to give some demo on how to use this library to explore your data and answer some of your bussiness question quickly.
The dataset we use consists of the marks secured by the students in various subjects, which accessible from Kaggle Student Performance in Exams.
The Inspiration is to understand the influence of the parents background, test preparation etc on students performance. It comprises of 1,000 rows and 8 columns:
The libraries we use today is pandas and plotly_express. You can install it by pip install plotly-express
import pandas as pd
import plotly_express as px
As usual we will read the data. What I do here are changing the coloumn name to more easier to use name and see how's the data looks.
df = pd.read_csv('data_input/StudentsPerformance.csv')
df.columns = ['gender', 'ethnicity', 'parental_level_of_education','lunch','test_preparation_course','math','reading','writing']
df.head()
Here we see what is the coloumn data type.
df.dtypes
As usuall first, let's see the distribution of our categorical data.
print(df.gender.value_counts(),"\n\n",
df.lunch.value_counts(),"\n\n",
df.ethnicity.value_counts(),"\n\n",
df.parental_level_of_education.value_counts(),"\n\n",
df.test_preparation_course.value_counts(),
sep='')
From what we see at this distribusion of our categorcial coloumn, here some insight we can take of:
Next let's see dthe distribusion of our numeric coloumn
df.describe()
From what we see at this distribusion of our categorical coloumn, here some insight we can take of:
After first exploration, we have a couple question that we can answer with this data, for the demo let's answer the these 2 question:
Before we start to answer question, let's first see the distribusion of our subject (gender and ethnicity). First plot we use from our library is bar plot. Bar plot is one of the most efective plot to answer a lot of question. At plotly library we can use bar(dataframe, x, y). The only parameter you need is dataframe and x.
fig = px.bar(df, x= 'gender')
fig.show(renderer="notebook")
That's how you make bar plot, then our first objective are to see how the distribusion of the ethnicity and gender. We can put etchincity as color to diffrenciate them.
fig = px.bar(df,
x= 'gender',
color='ethnicity')
fig.show(renderer="notebook")
We can't get clear picture from that plot, because the bar is stacked, we can change it with barmode = parameter. The default of is 'relative' to make the unstacked we can use 'group'.
fig = px.bar(df,
x= 'gender',
color='ethnicity',
barmode='group')
fig.show(renderer="notebook")
Now we already can see how their distribusion, but let's beautify our plot for a bit. First you can give the plot some theme, with template parameter, you can use the theme template they provides for example ploty_dark or plotly_white you can check out their documentation for more theme, at this one I'll use my favorite plotly_white.
Then if you noticed the order of their group is ungroup, we can reorder with category_order parameter that can accept dictionary. It will automaticly detect and order your category if you give correct category.
Lastly you can always give title to your plot.
fig = px.bar(df,
x= 'gender',
color='ethnicity',
template='plotly_white',
barmode='group',
category_orders={'ethnicity':["group A","group B","group C","group D","group E"]},
title= "Ethnicity Distribution on Gender")
fig.show(renderer="notebook")
You can actually make those plot in one line, but I put some enter so it's more user friendly. As we can see from the plot, we can se that it actually have a quite similiar distribusion between female and male, but at our dataset ethnicity C dominates. So with that distribution now it's safe to assume we can analize the gender in our dataset equaly.
So, let's answer our first question, is a certain gender excels in certain subject?
To answer this qeustion we will take math and reading subject, why? because I like both of subject. Just kidding, I math and reading because they are the subject with lowest average and the highest average. To answer the question we can use the scatter plot, yes conviniently we can use other plot to answer this question. To make scatter plot, you can guess we can use scatter function. We can make math and reading as x and y, next we can color them with the gender so we can see if there is some difference to answer our question and as usual I'll use the plotly_white template.
fig = px.scatter(df,
x='math',
y='reading',
color ='gender',
template='plotly_white',
title="Is a certain gender excels in certain subject?")
fig.show(renderer="notebook")
Oh we already get the answer from the plot, before we answer it we will give the marginal plot to see the distribusion of the score. How to see the distibusion? We can use the bar plot, to make a histogram.
fig = px.scatter(df,
x='math',
y='reading',
color ='gender',
marginal_x='histogram',
marginal_y='histogram',
template='plotly_white',
title="Is a certain gender excels in certain subject?")
fig.show(renderer="notebook")
Don't be fooled by the color, usually male colored by blue, but this time it switched. You can change it if you want, you know how, but for simplicity sake male will colored as red and female is blue. As you can see from the scatter plot male is better with the math subject but female excels in reading. From our marginal plot we also can see, most of female only score average score in math, while male score mostly score below average on reading.
So the answer for our question is yes, a certain gender excels in certain subject.
Is there a specific ethnicity that better at math?
Well you know there always a myth that a certain ethnicity better at math. Let's see if that assumtion true. We can use another type of plot, you can guess box plot also you can guess the fuction is 'box'. First let's prove if our first question true, we can check it with box plot, let's see math subject.
fig = px.box(df,
x='gender',
y='math',
template='plotly_white')
fig.show(renderer="notebook")
As you can see Male have higher median than female, that's the easiest way to see a certain catergory performs better with box plot, just see the median. Next let's try to answer out question. At this box plot I added one more parameter which is notched to help us see where is the median better.
fig = px.box(df,
x='ethnicity',
y='math',
template='plotly_white',
category_orders={'ethnicity':["group A","group B","group C","group D","group E"]},
title="Is there a specific ethnicity that better at math?",
notched=True)
fig.show(renderer="notebook")
There is a quite a lot of insight that we can take from box plot, like if there is much outlier in data, or how is the variance of the data. The point outside the whisker is an ourlier, while we can see how our data variance from how is the size of the box, if it longer then it have bigger variance. But we don't need that for answering our question, we just need to see where the median is to answer our question.
As we can see a certain ethnicity group is having much higher median, so it's save to assume, yes a specific ethnicity that better at math.
Even we already answer our initial question let see the data furher with grouping the data based on gender too. We can do it with adding facet_col parameter. Yes, we also can use that parameter at all plot, one of the advantage of the plotly express is all of the plot type have mostly same parameter, it's consitent.
fig = px.box(df,
x='ethnicity',
y='math',
color = 'gender',
template='plotly_white',
notched=True,
category_orders={'ethnicity':["group A","group B","group C","group D","group E"]},
facet_col = 'gender',
title="Is there a specific ethnicity and gender that better at math?")
fig.show(renderer="notebook")
After further investigation, actually the one that trully excel in math is male of ethnicity group E, while the other both female and male actually have similar median, but group E certainly higher than other group, but female at Group A certainly have much lower median. But if you see the max score (100) beside from group E, only male group A and D reach 100, so it's hard to say group A is worst at this subject.
The answer is yes for both question. There is actually more plot you can create with plotly express, but here these 3 are the most useful plot to answering question, most of question can answered only just these 3, maybe the other one that always useful is heatmap. You can always read more at the documentation, which I attach at the reference.
In the end hope this article help and make you want to use this library, but the question is will you present this notebook to your supervisor? Maybe yes, with very limited time but in long run you can't just show this notebook. You will need dashboard to support your presentation I suggest you also learn about dash. That library are made to make dashboard purely with python it based on flask so it's really lightweight and yep it also fully support plotly-express. Feel free to reach me in mentor@algorit.ma or handoyo@algorit.ma if you have more question or interested to know more about this subject.
And yes that of course not only insight you can draw with this dataset, you can try to draw more insight and answering the question I give at the quiz. Have fun exploring this data! and thank you for reading.